{% extends "../master.html" %} {% load static %} {% block title %} JupyterCrimeData {% endblock %} {% block additional_css %} {% endblock additional_css %} {% block content %}
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
# Create a boolean condition to filter rows where either 'lat' or 'lon' is equal to 0
condition = (y['LAT'] != 0) & (y['LON'] != 0)
# Use the loc method to select rows that meet the condition and create a new DataFrame
y_filtered = y.loc[condition]
[some of the crime statistics are missing locations so we can just drop that. ]
plt.scatter(y_filtered.LON, y_filtered.LAT)
plt.show()
This data seems very convoluted Let's try reducing it and cleaning it. The good news is that this is mirroring a picture of LA City. I'll provide an image below.
y_dat = pd.DataFrame({'LAT': y_filtered['LAT'], 'LON': y_filtered['LON']})
y_dat.head(6)
| LAT | LON | |
|---|---|---|
| 0 | 34.0141 | -118.2978 |
| 1 | 34.0459 | -118.2545 |
| 2 | 34.0448 | -118.2474 |
| 3 | 34.1685 | -118.4019 |
| 4 | 34.2198 | -118.4468 |
| 5 | 34.0452 | -118.2534 |
y_dat.plot(kind="scatter", x="LON", y="LAT", grid=True, alpha=0.01)
plt.show()
y = pd.read_csv('Crime_Data_from_2020_to_Present.csv')
In all honesty, it would probably be better to do random sampling. We'll do that later, for now we just want to see what the map looks like with all the data points together. Lastly, note that alpha is the transparency in our graph. 1 being the default. So if we set it at 0.01, the darker the location means the more often an incident is reported here.
LA city isn't Gotham City, as of 2021 Los Angeles had a total population of 3.849 million with 9,974 officers and 3,000 civilian staff, over a span of 502 mi². which is an average of 7669 persons per square area. That means per mile, we are averaging less than 1 officer. Wild. That would explain the density of our scatter plot however. Also note that incidents != crime nor does it imply that more officers are needed. We aren't doing causation or correlation for this, just pulling data.
In the next bit, we want a visual representation of a Map of LA vs our data. Does it reflect the city geographically? Let's take a look.
img = Image.open('detailed_map_of_los_angeles_city_and_neighborhoods.jpg')
display(img)
If you compare the scatter plot with the map of LA city, they're more or less the same. So we can safely assume the Latitude and Longitude are correct and representative of LA.
y.nunique(axis=0)
DR_NO 761582 Date Rptd 1294 DATE OCC 1294 TIME OCC 1439 AREA 21 AREA NAME 21 Rpt Dist No 1199 Part 1-2 2 Crm Cd 138 Crm Cd Desc 138 Mocodes 256088 Vict Age 103 Vict Sex 5 Vict Descent 20 Premis Cd 312 Premis Desc 306 Weapon Used Cd 79 Weapon Desc 79 Status 6 Status Desc 6 Crm Cd 1 140 Crm Cd 2 122 Crm Cd 3 35 Crm Cd 4 6 LOCATION 62698 Cross Street 9437 LAT 5392 LON 4966 dtype: int64
y_filtered['Vict Sex'].unique()
array(['F', 'M', 'X', nan, 'H', '-'], dtype=object)
F - Female M - Male X - Unknown
The vict_sex column is of the text datatype. I am not sure about H but I'm assuming it stands for Hamburglar.
Next up we want to parse through the MoCodes. But first we need to figure out how...thankfully the LAPD department provided an excellent spreadsheet in the form of a pdf instead of a csv.. so we have to parse through it ourselves through extra steps.. you can download it from here. So let's get straight to it.
from PyPDF2 import PdfReader
# creating a pdf reader object
reader = PdfReader('MO_CODES_Numerical_20191119.pdf')
# printing number of pages in pdf file
print(len(reader.pages))
# getting a specific page from the pdf file
page = reader.pages[0]
# extracting text from page
text = page.extract_text()
print(text)
19 REV: 07/19 0100 Suspect Impersonate 0101 Aid victim 0102 Blind 0103 Physicallydisabled 0104 Customer 0105 Delivery 0106 Doctor 0107 God 0108 Infirm 0109 Inspector 0110 Involved in traffic/accident 0112 Police 0113 Renting 0114 Repair Person 0115 Returning stolen property 0116 Satan 0117 Salesman 0118 Seeking someone 0119 Sent by owner 0120 Social Security/Medicare 0121 DWP/Gas Company/Utility worker 0122 Contractor 0123 Gardener/Tree Trimmer 0200 Suspect wore disguise 0201 Bag 0202 Cap/hat 0203 Cloth (with eyeholes) 0204 Clothes of opposite sex 0205 Earring 0206 Gloves 0207 Handkerchief 0208 Halloween mask 0209 Mask 0210 Make up (males only) 0211 Shoes 0212 Nude/partly nude 0213 Ski mask 0214 Stocking 0215 Unusual clothes 0216 Suspect wore hood/hoodie 0217 Uniform 0218 Wig 0219 Mustache-Fake 0220 Suspect wore motorcycle helmetMOCODES NUMERICAL 1 of 19
We have to go through 19 pages of this. :( Time to learn how! Note that this is called PDF mining... It's incredibly difficult because sometimes we get a mess...or someeeetimes the PDF comes corrupted.. whatever the case, the fact we have to go through this is unforgivable. How dare they.
import re
def extract_four_numbers_outputs(text):
pattern = r'(?s)\b(\d{4})\s+((?:(?!\\n(?:MOCODES|NUMERICAL|\d+ of \d+))[\s\S])*?)(?=\s*\b\d{4}\b|$)'
matches = re.finditer(pattern, text)
result = {match.group(1): re.sub(r'\s*(?:MOCODES|NUMERICAL|\n\d+ of \d+).*', '', match.group(2)).strip() for match in matches}
return result
output_dict = extract_four_numbers_outputs(text)
print(output_dict)
{'0100': 'Suspect Impersonate', '0101': 'Aid victim', '0102': 'Blind', '0103': 'Physicallydisabled', '0104': 'Customer', '0105': 'Delivery', '0106': 'Doctor', '0107': 'God', '0108': 'Infirm', '0109': 'Inspector', '0110': 'Involved in traffic/accident', '0112': 'Police', '0113': 'Renting', '0114': 'Repair Person', '0115': 'Returning stolen property', '0116': 'Satan', '0117': 'Salesman', '0118': 'Seeking someone', '0119': 'Sent by owner', '0120': 'Social Security/Medicare', '0121': 'DWP/Gas Company/Utility worker', '0122': 'Contractor', '0123': 'Gardener/Tree Trimmer', '0200': 'Suspect wore disguise', '0201': 'Bag', '0202': 'Cap/hat', '0203': 'Cloth (with eyeholes)', '0204': 'Clothes of opposite sex', '0205': 'Earring', '0206': 'Gloves', '0207': 'Handkerchief', '0208': 'Halloween mask', '0209': 'Mask', '0210': 'Make up (males only)', '0211': 'Shoes', '0212': 'Nude/partly nude', '0213': 'Ski mask', '0214': 'Stocking', '0215': 'Unusual clothes', '0216': 'Suspect wore hood/hoodie', '0217': 'Uniform', '0218': 'Wig', '0219': 'Mustache-Fake', '0220': 'Suspect wore motorcycle helmet'}
then we do this for every page... and we'll have our MOCodes. Note that I use regex here to find the texts with the code spefically, and added them to a dictionary that I created. Was this easy? Nope! This is one of the downsides to using PDFS TO STORE DATA. But at least it works..
for i in range(1,len(reader.pages),1):
# getting a specific page from the pdf file
page = reader.pages[i]
# extracting text from page
text = page.extract_text()
output_dict.update(extract_four_numbers_outputs(text))
print(len(output_dict))
819
And there you go, we have all of our MO codes (819 to be exact). I'll go over the Regex expression now so you understand what we did..
pattern = r'(?s)\b(\d{4})\s+((?:(?!\n(?:MOCODES|NUMERICAL|\d+ of \d+))[\s\S])?)(?=\s\b\d{4}\b|$)'
what does that mean?
(?s): This is a flag that enables the "dot matches all" mode in the regular expression. It allows the dot . to match any character, including newline characters. Placing (?s) at the beginning of the pattern applies this mode to the entire regular expression.
\b: This is a word boundary anchor, which matches the empty string at the beginning or end of a word. It ensures that the pattern matches complete words only.
(\d{4}): This is a capturing group that matches four consecutive digits. It captures the four-digit number as a group, allowing us to extract it later.
\s+: This matches one or more whitespace characters (spaces, tabs, newlines, etc.) after the four-digit number.
((?:(?!\\n(?:MOCODES|NUMERICAL|\d+ of \d+))[\s\S])*?): This is a capturing group that matches the content following the four-digit number. The key part of this expression is the negative lookahead (?:(?!\\n(?:MOCODES|NUMERICAL|\d+ of \d+))[\s\S]).
(?!\\n(?:MOCODES|NUMERICAL|\d+ of \d+)): This is a negative lookahead assertion, which ensures that the match stops if it encounters the following pattern:
\\n: Matches a literal newline character \n.
(?:MOCODES|NUMERICAL|\d+ of \d+): This is a non-capturing group that matches three alternatives separated by |. It matches either "MOCODES", "NUMERICAL", or a pattern of the form \d+ of \d+ (where \d+ matches one or more digits). This prevents the match from including any occurrence of "\nMOCODES", "\nNUMERICAL", or "\n\d+ of \d+".
[\s\S]: This is a character class that matches any character, including newlines (because of the (?s) flag). It allows the pattern to match across multiple lines.
*?: This quantifier matches the previous group zero or more times in a non-greedy manner, which means it matches as few characters as possible.
(?=\s*\b\d{4}\b|$): This is a positive lookahead assertion that ensures the match stops at the next four-digit number or at the end of the text ($ anchor). It looks for zero or more whitespace characters (\s*) followed by a four-digit number (\d{4}) with word boundaries (\b) or the end of the text ($).
I basically did this in chatgpt to make quick work of the length...
y_filtered.Mocodes.head(6)
0 0444 0913 1 0416 1822 1414 2 1501 3 0329 1402 4 0329 5 0413 1822 1262 1415 Name: Mocodes, dtype: object
y_filtered.Mocodes[0]
'0444 0913'
test = y_filtered.Mocodes[0].split(" ")
for i in range(len(test)):
test[i] = output_dict[test[i]]
print(test)
['Pushed', 'Victim knew Suspect']
This works as expected, we won't combine the values as the Mocodes are distinguishable enough. Only for human eyes will we replace them. That will be done in the end when graphs come into play.
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
# Create a Pandas DataFrame
y_temp = y_filtered[['LON', 'LAT']].copy()
# Use .loc to modify the original DataFrame
y_temp[y_temp.columns[0]] = y_temp[y_temp.columns[0]].astype(int)
y_temp[y_temp.columns[1]] = y_temp[y_temp.columns[1]].astype(int)
X = y_temp[['LON', 'LAT']].to_numpy()
X = resample(X, n_samples=int(len(X)*0.01))
X = StandardScaler().fit_transform(X)
import numpy as np
from sklearn import metrics
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
Estimated number of clusters: 2 Estimated number of noise points: 0
# Function to calculate the k-distance of each point
def calculate_k_distance(X, k):
# Compute pairwise distances
distances = metrics.pairwise_distances(X)
# Sort distances for each point and take the k-th smallest distance
k_distances = np.sort(distances, axis=1)[:, k]
return k_distances
# Calculate k-distance for each point using a chosen k value
k_value = 4
k_distances = calculate_k_distance(X, k_value)
# Plot the k-distance graph
plt.plot(np.arange(len(k_distances)), k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'{k_value}-Distance')
plt.title(f'K-Distance Graph (k={k_value})')
plt.show()
No Comment.
y_filtered['AREA NAME'].unique()
array(['Southwest', 'Central', 'N Hollywood', 'Mission', 'Devonshire',
'Northeast', 'Harbor', 'Van Nuys', 'West Valley', 'West LA',
'Wilshire', 'Pacific', 'Rampart', '77th Street', 'Hollenbeck',
'Southeast', 'Hollywood', 'Newton', 'Foothill', 'Olympic',
'Topanga'], dtype=object)
len(y_filtered['AREA NAME'].unique())
21
Looks like we have 21 Unique locations.
Upon closer introspection, it appears I did this algorithm for no apparent reason. Mistakes were made and I will willfully pretend this never happend.
incidents = y_filtered['AREA NAME'].value_counts().to_dict()
print(incidents)
{'Central': 50920, '77th Street': 48031, 'Pacific': 44262, 'Southwest': 42479, 'Hollywood': 40201, 'Southeast': 38863, 'Olympic': 38494, 'Newton': 37842, 'N Hollywood': 37800, 'Wilshire': 35877, 'Rampart': 35277, 'West LA': 34892, 'Northeast': 32875, 'Van Nuys': 32278, 'West Valley': 31834, 'Harbor': 31524, 'Topanga': 30822, 'Devonshire': 30762, 'Mission': 30229, 'Hollenbeck': 28563, 'Foothill': 25491}
max_value = max(incidents, key=lambda x:incidents[x])
print(f'{max_value}: {incidents.get(max_value)}')
Central: 50920
sorted_vals = [i for i in incidents.values()]
print(sorted_vals)
[50920, 48031, 44262, 42479, 40201, 38863, 38494, 37842, 37800, 35877, 35277, 34892, 32875, 32278, 31834, 31524, 30822, 30762, 30229, 28563, 25491]
sorted_vals.sort()
print(sorted_vals)
[25491, 28563, 30229, 30762, 30822, 31524, 31834, 32278, 32875, 34892, 35277, 35877, 37800, 37842, 38494, 38863, 40201, 42479, 44262, 48031, 50920]
sorted_keys = [key for key, value in sorted(incidents.items(), key=lambda item: item[1])]
print(sorted_keys)
['Foothill', 'Hollenbeck', 'Mission', 'Devonshire', 'Topanga', 'Harbor', 'West Valley', 'Van Nuys', 'Northeast', 'West LA', 'Rampart', 'Wilshire', 'N Hollywood', 'Newton', 'Olympic', 'Southeast', 'Hollywood', 'Southwest', 'Pacific', '77th Street', 'Central']
In terms of the locatiosn with the least amount of reported incidents, we have Foothill as our answer. If asked in reverse, Central would be the most with a wopping 50k incidents! double that of Foothill.
In terms of when this incident occurs, we can look at time occurred but not the date occured value. It appears that date occured might just be an upload date with wrong headers.
y_filtered['AREA NAME'].value_counts()
Central 50920 77th Street 48031 Pacific 44262 Southwest 42479 Hollywood 40201 Southeast 38863 Olympic 38494 Newton 37842 N Hollywood 37800 Wilshire 35877 Rampart 35277 West LA 34892 Northeast 32875 Van Nuys 32278 West Valley 31834 Harbor 31524 Topanga 30822 Devonshire 30762 Mission 30229 Hollenbeck 28563 Foothill 25491 Name: AREA NAME, dtype: int64
courses = list(incidents.keys())
values = list(incidents.values())
fig = plt.figure(figsize = (10, 5))
# creating the bar plot
plt.bar(courses, values, color ='maroon',
width = 0.4)
plt.title("Incidents")
plt.xticks(rotation=45)
plt.show()
Okay so we should look into the number 1 incident reported in LA. Let's give it a shot
dfCen = y_filtered[(y_filtered['AREA NAME'] == 'Central')]
dfCen.head(6)
| DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | ... | Status | Status Desc | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 190101086 | 01/02/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 330 | 1 | Central | 163 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | IC | Invest Cont | 624.0 | NaN | NaN | NaN | 700 S HILL ST | NaN | 34.0459 | -118.2545 |
| 2 | 200110444 | 04/14/2020 12:00:00 AM | 02/13/2020 12:00:00 AM | 1200 | 1 | Central | 155 | 2 | 845 | SEX OFFENDER REGISTRANT OUT OF COMPLIANCE | ... | AA | Adult Arrest | 845.0 | NaN | NaN | NaN | 200 E 6TH ST | NaN | 34.0448 | -118.2474 |
| 5 | 200100501 | 01/02/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 30 | 1 | Central | 163 | 1 | 121 | RAPE, FORCIBLE | ... | IC | Invest Cont | 121.0 | 998.0 | NaN | NaN | 700 S BROADWAY | NaN | 34.0452 | -118.2534 |
| 6 | 200100502 | 01/02/2020 12:00:00 AM | 01/02/2020 12:00:00 AM | 1315 | 1 | Central | 161 | 1 | 442 | SHOPLIFTING - PETTY THEFT ($950 & UNDER) | ... | IC | Invest Cont | 442.0 | 998.0 | NaN | NaN | 700 S FIGUEROA ST | NaN | 34.0483 | -118.2631 |
| 7 | 200100504 | 01/04/2020 12:00:00 AM | 01/04/2020 12:00:00 AM | 40 | 1 | Central | 155 | 2 | 946 | OTHER MISCELLANEOUS CRIME | ... | IC | Invest Cont | 946.0 | 998.0 | NaN | NaN | 200 E 6TH ST | NaN | 34.0448 | -118.2474 |
| 8 | 200100507 | 01/04/2020 12:00:00 AM | 01/04/2020 12:00:00 AM | 200 | 1 | Central | 101 | 1 | 341 | THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI... | ... | IC | Invest Cont | 341.0 | 998.0 | NaN | NaN | 700 BERNARD ST | NaN | 34.0677 | -118.2398 |
6 rows × 28 columns
CIncCen = [i.split(" ") for i in dfCen.Mocodes.value_counts().to_dict().keys()]
CIncCen[:10]
[['0344'], ['0329'], ['0344', '1822'], ['1822', '0344'], ['1501'], ['0329', '1822'], ['0416', '1822'], ['0416'], ['0325'], ['1609', '0344', '1307', '1822', '0329']]
DictCIncCen = {}
for i in CIncCen:
for j in i:
if j in DictCIncCen:
DictCIncCen[j] = DictCIncCen[j] + 1
else:
DictCIncCen[j] = 1
mK = max(DictCIncCen, key=lambda x:DictCIncCen[x])
Let's see the first 10% of MO Codes and their count in our dictionary.
l = list(DictCIncCen.keys())
for i in range(len(l)%10):
print(f'key: {l[i]}, value: {DictCIncCen[l[i]]}')
key: 0344, value: 9012 key: 0329, value: 4104 key: 1822, value: 15707 key: 1501, value: 640
print(f'Most common Incident: "{mK}", number of reports: "{DictCIncCen.get(mK)}"')
Most common Incident: "1822", number of reports: "15707"
In human eyes this would be:
print(f'Most common Incident: {output_dict[mK]}, number of reports {mK}')
Most common Incident: Stranger, number of reports 1822
Looks like stranger danger is a bigger deal than we imagined. Let's replace the MoCodes with a human readable format and sort it so you can take a peak at it.
HDCincCen = { output_dict[key]: val for key,val in DictCIncCen.items()}
Now that that's done, let's print the first 10 values. We'll work on sorting it after.
list(HDCincCen.keys())[:10]
['Removes vict property', 'Vandalized', 'Stranger', 'Other MO (see rpt)', 'Hit-Hit w/ weapon', 'Took merchandise', 'Smashed', 'Breaks window', 'Victim left property unattended', "Takes vict's identification/driver license"]
courses = list(HDCincCen.keys())
values = list(HDCincCen.values())
fig = plt.figure(figsize = (10, 5))
# creating the bar plot
plt.bar(courses, values, color ='maroon',
width = 0.4)
plt.title("Incidents")
plt.xticks(rotation=45)
plt.show()
Obviously we have a problem here, let's drop all values less than 1000
HDCincCen2 = {}
for key,val in HDCincCen.items():
if val >= 1000:
HDCincCen2[key] = val
len(list(HDCincCen2.values()))
26
courses = list(HDCincCen2.keys())
values = list(HDCincCen2.values())
fig = plt.figure(figsize = (10, 5))
# creating the bar plot
plt.bar(courses, values, color ='maroon',
width = 0.4)
plt.title("Incidents")
plt.xticks(rotation=90)
plt.show()
how many values do we have? Let's check
len(HDCincCen2)
26
What next? Let's do this for every one.
# from IPython.display import display
# from ipywidgets import Dropdown
# %matplotlib inline
output_dict['2207'] = 'Amusement Park' # for some reason this was missed.
output_dict['2100'] = 'Observation/Surveillance'
areas = [key for key, value in sorted(incidents.items(), key=lambda item: item[1])]
for i in areas:
dfCen = y_filtered[(y_filtered['AREA NAME'] == i)]
CIncCen = [i.split(" ") for i in dfCen.Mocodes.value_counts().to_dict().keys()]
DictCIncCen = {}
for i in CIncCen:
for j in i:
if j in DictCIncCen:
DictCIncCen[j] = DictCIncCen[j] + 1
else:
DictCIncCen[j] = 1
HDCincCen = {}
for key,val in DictCIncCen.items():
try:
HDCincCen[output_dict[key]] = val
except:
next
# HDCincCen = { output_dict[key]: val for key,val in DictCIncCen.items()}
HDCincCen2 = {}
for key,val in HDCincCen.items():
if val >= 1000:
HDCincCen2[key] = val
courses = list(HDCincCen2.keys())
values = list(HDCincCen2.values())
fig = plt.figure(figsize = (10, 5))
# creating the bar plot
plt.bar(courses, values, color ='maroon',
width = 0.4)
plt.title("Incidents")
plt.xticks(rotation=90)
plt.show()
# def dropdown_eventhandler(change):
# print(change.new)
# dfCen = y_filtered[(y_filtered['AREA NAME'] == change.new)]
# CIncCen = [i.split(" ") for i in dfCen.Mocodes.value_counts().to_dict().keys()]
# DictCIncCen = {}
# for i in CIncCen:
# for j in i:
# if j in DictCIncCen:
# DictCIncCen[j] = DictCIncCen[j] + 1
# else:
# DictCIncCen[j] = 1
# HDCincCen = {}
# for key,val in DictCIncCen.items():
# try:
# HDCincCen[output_dict[key]] = val
# except:
# next
# # HDCincCen = { output_dict[key]: val for key,val in DictCIncCen.items()}
# HDCincCen2 = {}
# for key,val in HDCincCen.items():
# if val >= 1000:
# HDCincCen2[key] = val
# courses = list(HDCincCen2.keys())
# values = list(HDCincCen2.values())
# fig = plt.figure(figsize = (10, 5))
# # creating the bar plot
# plt.bar(courses, values, color ='maroon',
# width = 0.4)
# plt.title("Incidents")
# plt.xticks(rotation=90)
# plt.show()
# dropdown = Dropdown(description="Choose one:", options=areas)
# dropdown.observe(dropdown_eventhandler, names='value')
# display(dropdown)
We can also do the same for the "crime committed description category". Note that it looks like for majority of the crimes, the victim knew the suspect.
y_filtered['Crm Cd Desc'].value_counts()
VEHICLE - STOLEN 81360
BATTERY - SIMPLE ASSAULT 60044
THEFT OF IDENTITY 49051
BURGLARY FROM VEHICLE 46902
VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 46589
...
INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES) 4
PICKPOCKET, ATTEMPT 3
FAILURE TO DISPERSE 2
DISHONEST EMPLOYEE ATTEMPTED THEFT 2
INCITING A RIOT 1
Name: Crm Cd Desc, Length: 138, dtype: int64
value_counts = y_filtered['Crm Cd Desc'].value_counts()
value_counts[(value_counts > 20000)]
VEHICLE - STOLEN 81360 BATTERY - SIMPLE ASSAULT 60044 THEFT OF IDENTITY 49051 BURGLARY FROM VEHICLE 46902 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 46589 BURGLARY 46296 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 43718 THEFT PLAIN - PETTY ($950 & UNDER) 38792 INTIMATE PARTNER - SIMPLE ASSAULT 38195 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) 29566 THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) 27926 ROBBERY 26065 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD 24458 VANDALISM - MISDEAMEANOR ($399 OR UNDER) 20759 Name: Crm Cd Desc, dtype: int64
Overall it looks like the largest amount of reported crime is due to theft.
areas = [key for key, value in sorted(incidents.items(), key=lambda item: item[1])]
for ar in areas:
dfCen = y_filtered[(y_filtered['AREA NAME'] == ar)]
CIncCen = [a.split(" ") for a in dfCen.Mocodes.value_counts().to_dict().keys()]
DictCIncCen = {}
for i in CIncCen:
for j in i:
if j in DictCIncCen:
DictCIncCen[j] = DictCIncCen[j] + 1
else:
DictCIncCen[j] = 1
HDCincCen = {}
for key,val in DictCIncCen.items():
try:
HDCincCen[output_dict[key]] = val
except:
next
# HDCincCen = { output_dict[key]: val for key,val in DictCIncCen.items()}
HDCincCen2 = {}
for key,val in HDCincCen.items():
if val >= 1000:
HDCincCen2[key] = val
keys = list(HDCincCen2.keys())
count_vals = list(HDCincCen2.values())
df = pd.DataFrame({'keys': keys, 'count': count_vals})
df = df.sort_values(by='count', ascending=False)
print(f'Area: {ar}')
print(df.head(3))
Area: Foothill
keys count
3 Victim knew Suspect 3224
2 Hit-Hit w/ weapon 2393
1 Stranger 2105
Area: Hollenbeck
keys count
1 Victim knew Suspect 3567
2 Force used 2483
5 Stranger 2181
Area: Mission
keys count
6 Victim knew Suspect 3898
2 Stranger 2847
1 Hit-Hit w/ weapon 2622
Area: Devonshire
keys count
14 Victim knew Suspect 5238
0 Removes vict property 4055
3 Stranger 3607
Area: Topanga
keys count
0 Removes vict property 2554
2 Stranger 2389
1 Hit-Hit w/ weapon 1808
Area: Harbor
keys count
4 Victim knew Suspect 6655
0 Removes vict property 3878
3 Stranger 3659
Area: West Valley
keys count
0 Stranger 9048
5 Victim knew Suspect 6230
1 Removes vict property 4676
Area: Van Nuys
keys count
0 Removes vict property 3729
1 Stranger 3637
8 Photographs 2268
Area: Northeast
keys count
2 Stranger 4135
5 Victim knew Suspect 3153
0 Removes vict property 2935
Area: West LA
keys count
1 Stranger 8192
0 Removes vict property 6101
10 Video surveillance booked/available 1742
Area: Rampart
keys count
1 Stranger 10315
7 Victim knew Suspect 6743
0 Removes vict property 5065
Area: Wilshire
keys count
2 Stranger 4274
0 Removes vict property 3264
4 Hit-Hit w/ weapon 2179
Area: N Hollywood
keys count
1 Stranger 3360
0 Removes vict property 3229
3 Hit-Hit w/ weapon 2753
Area: Newton
keys count
4 Stranger 6440
5 Victim knew Suspect 5858
0 Removes vict property 4421
Area: Olympic
keys count
0 Stranger 11662
3 Victim knew Suspect 6145
1 Removes vict property 5989
Area: Southeast
keys count
4 Victim knew Suspect 10760
0 Stranger 10009
5 Hit-Hit w/ weapon 6204
Area: Hollywood
keys count
0 Stranger 10857
6 Force used 8179
1 Removes vict property 5029
Area: Southwest
keys count
1 Stranger 12726
7 Victim knew Suspect 8727
0 Removes vict property 6993
Area: Pacific
keys count
1 Stranger 3220
0 Removes vict property 1964
3 Force used 1964
Area: 77th Street
keys count
4 Victim knew Suspect 10414
2 Stranger 9938
6 Hit-Hit w/ weapon 7045
Area: Central
keys count
2 Stranger 15707
0 Removes vict property 9012
3 Hit-Hit w/ weapon 6216
It appears that in the safer areas, the victim knew the suspect. In the areas with higher reported instances of crime, this is not the case.
What about the general area in which this occurs? We'll take a look at the streets. Maybe cross streets as well, we'll be expecting a ton of them, but we'll count them anyways for maybe some surprising results.
We don't necessarily need to break this apart, all streets should be unique. We'll also later sort by cross streets
incidents
{'Central': 50920,
'77th Street': 48031,
'Pacific': 44262,
'Southwest': 42479,
'Hollywood': 40201,
'Southeast': 38863,
'Olympic': 38494,
'Newton': 37842,
'N Hollywood': 37800,
'Wilshire': 35877,
'Rampart': 35277,
'West LA': 34892,
'Northeast': 32875,
'Van Nuys': 32278,
'West Valley': 31834,
'Harbor': 31524,
'Topanga': 30822,
'Devonshire': 30762,
'Mission': 30229,
'Hollenbeck': 28563,
'Foothill': 25491}
len(y_filtered.LOCATION.unique())
62662
Okay so we have 62k unique streets. Probably not worth checking. What about cross streets?
len(y_filtered['Cross Street'].unique())
9427
This is quite a bit. I'd argue it's not worth looking at either as the surface area is too large.
y = len(y_filtered[(y_filtered['AREA NAME'] == 'Foothill')]['Cross Street'].unique())
print(y)
print((y/incidents['Foothill'])*100)
597 2.3420030599034956
Looks like cross streets account for 2% of our reported crimes in our safest zone.
y = len(y_filtered[(y_filtered['AREA NAME'] == 'Central')]['Cross Street'].unique())
print(y)
print((y/incidents['Central'])*100)
552 1.0840534171249017
and 1% for our least.
Okay so it looks like cross streets don't really matter.
To be straight to the point, that's it. There's some things I can use with this data but in regards to predictions. I feel like there's more I can do with it, but right now I have to set aside this project. If you want to learn more you can always check out this article: https://www.newswise.com/pdf_docs/165643264647045_crime_nature_human_behavior.pdf